Cream of the Crop 1

home *** CD-ROM | disk | FTP | other *** search

/ Cream of the Crop 1 / Cream of the Crop 1.iso / EDUCATE / SP12EXE.ARJ / SPELCHEK.DOC < prev next >

Wrap

Text File | 1991-03-28 | 12KB | 283 lines

SPELCHEK Version 1.2 - A *FAST* spelling checker by Edwin Floyd. 3-28-91 Version 1.2 implements a new, faster dictionary algorithm which is incompatible with previous versions. Please rebuild all user dictionaries with MAKEDICT. SPELCHEK is distributed in three files: SP12EXE.ZIP - Executable programs SP12DCT.ZIP - Dictionaries (large file) SP12SRC.ZIP - TP6.0 source code to all programs Purpose of SPELCHEK ------------------- SPELCHEK extracts words from an input file, or several input files, and checks them for membership in a superimposed code dictionary. Any words not found in the dictionary, it writes to an output file, one per line. The program recognizes a number of options for: o High-order bit stripping o Appending additional information to the output word list o Defining the characters comprising a "word" How to run SPELCHEK ------------------- From the DOS command line enter: SPELCHEK filenames [-H] [-M] [-W[+/-]abc..] [@name] [-Uname] [-Oname] [-Ppath] Spaces delimit command line parameters. You may intermingle input text filenames and options (mark each option with a leading hyphen). Filenames may include wild-cards. Some options (-W,-O,-U) allow a character string or filename to follow the option letter. This must follow with no intervening spaces or the program will mistake it for an input file name. Some options (-H,-M) allow a "+" or "-" to indicate "on" or "off". This also must follow with no intervening space, and "+" is assumed if it is omitted. You may place options and filenames in an ASCII "include" file and specify its name with a leading "@" on the command line. An include file may contain references to other include files. You also may specify default options, filenames and include files in the DOS environment using "SET SPELCHEK=...". For example: SET SPELCHEK=-H+ -Owords.out -W-ABCDEFGHIJKLMNOPQRSTUVWXYZ SET WORDS=@defaults.spc -O SPELCHEK processes options left-to-right, first from the DOS environment, then from the command line. Where options conflict, the last option processed prevails. Thus, you may override "SET" environment options on the command line. What the options mean --------------------- -H[+/-] Clear the high-order bit on each input character (default off). Use this option to process files created by word processing programs, like WordStar, that mark some letters by setting the high-order bit, often at the beginning or end of a word. -M[+/-] Append markup information to output word list. This causes the program to insert a number in front of each word written to the output file. The number indicates the byte position in the input where the offending word begins. The first byte in the input file is position 1. Also, the program writes the file name at the beginning of the word list for each input file. The file name is preceded by a zero and a space. This output file is intended as input to a program such as MARKDOC which marks misspelled words in the input document. -P[path] Indicate the drive and directory containing the master dictionary files. There are seven master dictionary files: AB.DCT, CD.DCT, EH.DCT, IN.DCT, OR.DCT, ST.DCT and UZ.DCT. They all must reside in the same directory. If no -P path is specified, the master dictionary files must reside in the current directory or the program directory. The master dictionary files were created with MAKEDICT (see below) from a list of over a hundred thousand words obtained from from Public Brand Software, 1-800-IBM-DISK. -U[name] Name a user dictionary file. This option specifies the name of an existential dictionary file produced by the MAKEDICT program. You may specify the drive and full path. If a simple file name is specified, the file is assumed to be in the current directory. If SPELCHEK can't open the user dictionary, it issues a warning message and processes the input files against the master dictionaries only. -W-abc.. Replace the "word character set" with the indicated characters. The program checks each character in each input file for membership in the word character set and defines a "word" as an uninterrupted sequence of at least one but no more than 255 characters which are members of that set. The default is the set of upper and lower case alphabetic characters. -W+abc.. Add additional characters to the word character set. -O[name] Name the output file. If the name is omitted ("-O "), output goes to "StdOut" and is available for DOS a pipe (|) or redirection (>). StdOut is the default. -O- Suppress output. -Onul also suppresses output. The program will still display word counts on the screen. Three examples -------------- 1. Generate list of all misspelled words in the document named MYSTORY.DOC and write the list to file MYSTORY.BWD. The following are equivalent: SPELCHEK mystory.doc -Omystory.bwd SPELCHEK mystory.doc >mystory.bwd (default StdOut) SET SPELCHEK=-Omystory.bwd (set defaults) SPELCHEK mystory.doc If at this point we want an alphabetic, un-duplicated list of misspelled words, we can use the WORDS program (see WORDS.DOC for other uses): WORDS mystory.bwd -omystory.unq -a 2. Generate list of misspelled words in the documents named HISPHYS.WS and OPREPORT.WS and use the list as input for MARKDOC to mark misspelled words in both documents. The files are WordStar documents and we wish to check a user dictionary called MEDTERM.DCT in the current directory. The main dictionary files reside in directory: D:\SPELL. SET SPELCHEK=-Pd:\spell -H -O -M -Umedterm.dct SPELCHEK hisphys.ws opreport.ws | MARKDOC We could have specified all the options on the command line. Ordinarily you should set the -P and -U options in the environment. 3. Generate an alphabetized, unduplicated list of misspelled words in all the documents in the C:\SPDOC directory. Dictionaries and parameters are as in example two. SET SPELCHEK=-Pd:\spell -H -O -M -Umedterm.dct SPELCHEK c:\spdoc\*.doc -ospelchek.bwd WORDS spelchek.out -ounique.bwd -a File UNIQUE.BWD now contains the alphabetized list of unique, misspelled words from all *.DOC files in the directory. Networks -------- FYI, network users, SPELCHEK opens its input files in "Read, Deny None" mode, @include files "Read, Compatibility", and the output file in "Write, Compatibility". Only one input file at a time is open, except during processing of nested @include files. MAKEDICT -------- MAKEDICT creates an optimal existential dictionary (Bloom filter) which can be used by SPELCHEK with the "-U" option (see above). From the DOS command line, enter: MAKEDICT infile [bits] [extra] The input file should be a list of words, one per line. All characters should be upper case if the dictionary is intended for use with SPELCHEK. The second parameter, "bits", specifies the number of bits to superimpose for each input word. The number of bits partly determines the accuracy of the dictionary. For use with SPELCHEK, specify the default, 14 bits. The third parameter, "extra", specifies an allowance of extra space so words may be added to the dictionary and it still remain within the accuracy specified by the "bits" parameter. The default is zero. The output file is given the same name as the input file, except the extension is ".DCT". If the input file extension is ".DCT", the output file is given the extension ".DIC". To create a user dictionary for SPELCHEK, only the input file need be specified. The defaults for "bits" and "extra" are exactly what is required for a user dictionary. Example: MAKEDICT medterm.lst This creates a user dictionary called: MEDTERM.DCT suitable for use by SPELCHEK. MAKEDICT prints dictionary statistics, including the odds against incorrectly recognizing a word which is not in the dictionary. Please remember, a Bloom filter is a probabilistic technique; collisions are possible, but you control the collision probability by the bits setting. All main dictionaries were created with 14 bits, corresponding to about a 1/16384 chance of collision. When you specify a user dictionary, the odds increase to 1/16384 plus the user dictionary odds. Thus, a 14-bit user dictionary would increase the odds of a collision to about 1/8192. This means, on the average, SPELCHEK will miss about one out of every 8192 different misspelled words. For instance, if a really bad speller misspells (differently) about every tenth word in an 80,000-word document, SPELCHEK may miss one of the misspellings. MARKDOC ------- MARKDOC reads the output file produced by SPELCHEK with the -M+ option and marks misspelled words in the input files. From the DOS command line, enter: MARKDOC [markchars] [<infile] MARKDOC reads its standard input file (STDIN). Each input line begins with a number. The number zero is always followed by a document file name. Each non-zero number indicates the position of the first character of a misspelled word in the current document file. MARKDOC reads each document file and writes an output file which is the same as the input file, except each misspelled word is preceded by "mark" characters. The default mark character is a single "#", but you may specify mark characters as a parameter on the command line. Examples: SPELCHEK document.fil -M+ | MARKDOC %@ SPELCHEK -M+ document.fil -Omark.$$$ MARKDOC <mark.$$$ MARKDOC saves a copy of the document file under the same name as the original document except with the extension ".BAK". Note: MARKDOC expects to read a file produced by SPELCHEK with the -M+ option. If this option is not set, MARKDOC will abort with a Pascal error 106. MARKDOC is intended as a demonstration of one use of the -M+ output file. Its crash resistance should be improved before it's let out into the real world. WORDS ----- WORDS is a word extractor program useful for creating word lists for MAKEDICT, among other things. See WORDS.DOC for documentation. Legal Stuff ----------- SPELCHEK.EXE, MAKEDICT.EXE, MARKDOC.EXE, WORDS.EXE, SPELCHEK.DOC, and WORDS.DOC and all source code files, dictionaries, and word lists are: Copyright (c) 1990,91 by Edwin T. Floyd, All rights reserved. SPELCHEK is copyrighted "free" software. The author hereby expressly permits and encourages individuals to use SPELCHEK at home and at work and to distribute it without charge. The author prohibits distribution of SPELCHEK for profit, or as a part of a product sold for profit, except where explicit written permission has been obtained from the author for such distribution. Also, users groups and shareware libraries charging a disk duplication fee not exceeding $10.00 may distribute SPELCHEK. The author makes no warranties of any kind, either expressed or implied, as to mercantability or fitness for any particular purpose. SPELCHEK, et. al., are available as is and in no event will the author be held liable for damages, including any lost profits or incidental or consequential damages, even if the author has been advised of the possibility of such damages. Authorship ---------- SPELCHEK was written in Turbo Pascal v6.0 by: Edwin T. Floyd [76067,747] (CompuServe) #9 Adams Park Court 404/576-3305 (work) Columbus, GA 31909 404/322-0076 (home) The latest version of SPELCHEK is available on CompuServe in the IBMAPP forum, and on a number of bulletin boards around the country. - Edwin - 3-28-91 Revision History ---------------- 05-13-90 V1.0 ETF Original release & DDJ submission. 01-10-91 V1.1 ETF Test version, Bloom filter CRC algorithm (not released) 03-28-91 V1.2 ETF Update for TP6.0 and second public release